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Abstract 

Web usage mining is a type of web mining, which exploits data mining techniques to 
discover valuable information from navigation behavior of World Wide Web users. 
The first phase of web usage mining is the data processing phase, which includes the 
session reconstruction operation from server logs. Session reconstruction success 
directly affects the quality of the frequent patterns discovered in the next phase. 
In reactive web usage mining techniques, the source data is web server logs and 
the topology of the web pages served by the web server domain. Other kinds of 
information collected during the interactive browsing of web site by user, such 
as cookies or web logs containing similar information, are not used. The next 
phase of web usage mining is discovering frequent user navigation patterns. In 
this phase, pattern discovery methods are applied on the reconstructed sessions 
obtained in the first phase in order to discover frequent user patterns. In this paper, 
we propose a frequent web usage pattern discovery method that can be applied 
after session reconstruction phase. In order to compare accuracy performance of 
session reconstruction phase and pattern discovery phase, we have used an agent 
simulator, which models behavior of web users and generates web user navigation 
as well as the log data kept by the web server. 
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1 Introduction 



The goal in web mining [6] is to discover and retrieve useful and interesting 
patterns from a large dataset. The source data for web mining contains 
various information sources in different formats. Web usage mining (WUM) 
[25] is a new research area which can be defined as a process of applying data 
mining techniques to discover interesting patterns from web usage data. 
Web usage mining provides information for better understanding of server 
needs and web domain design requirements of web-based applications. Web 
usage data contains information about the identity or origin of web users 
with their browsing behaviors in a web domain. Web pre-fetching [13,19], 
link prediction [12,9,1], site reorganization [21,24] and web personalization 
[14,15,16,18] are common applications of WUM. 

WUM data contains users' navigation behaviors on the web. Navigation 
among web pages by using hyperlinks is the most common action of the 
web user. Two web pages can be accepted as related to each other if both of 
them are accessed in the same user session such that the first page accessed is 
connected to the second one with a hyperlink. In order to support the claim 
about two pages being related, such accesses must occur several times. 
Therefore, in WUM, first user navigation sessions must be reconstructed 
from server access logs, and then, frequent patterns in these sessions must 
be searched. 

Reconstruction of accurate user sessions from server access logs is a chal- 
lenging task since access log protocol is stateless and connectionless. For 
reactive strategies, all users behind a proxy server will have the same IP 
number also. Moreover, caching performed by the clients' browsers and 
proxy servers will affect the web log data. These problems can be handled 
by proactive strategies by using cookies and/or java applets. However, these 
solutions could have been disabled by some clients for security/privacy con- 
cerns. In such cases proactive strategies become unusable. Reactive session 
reconstruction and proactive session reconstruction strategies use different 
data sources. Proactive strategies [10,20] uses raw data collected during 
run-time which is usually supported by dynamic server pages. Whereas in 
reactive strategies [7,8,22], server logs are main data source. Reactive strate- 
gies are mostly applied on static web pages. Because the content of dynamic 
web pages changes in time, it is difficult to predict the relationship between 
web pages and obtain meaningful navigation path patterns. Therefore we 
restrict our work to static web pages. As it is stated above, server logs are 
the main data source of reactive strategies. The information required to ob- 
tain session information are user's IP address, access date and time, and the 
URL of the page accessed. These three attributes are included in common 
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log format 1 . 



There are several previous works related to mining web access patterns 
[8,11/17,25]. We use modified apriori technique adapted for sequence dis- 
covery for discovering frequent access paths. This idea is not new [3,11,17], 
however, to the best of our knowledge, the use of web topology for extend- 
ing the large itemsets through iterations of the apriori technique is novel. 
In this paper, not only we show that the discovery of frequent maximal 
navigation patterns from already reconstructed patterns utilizing the web 
topology can be done very easily, but we also show that the accuracy of 
the discovered frequent patterns is much higher than the accuracy of the 
reconstructed sessions. Therefore, it is worthwhile to make extra effort to 
increase the accuracy of the reconstructed sessions. 



The main aim of our work is to discover frequent user session patterns. The 
results of this work can be used in applications such as web pre-fetching. 
The problem of which page will be requested from the current page can be 
solved by applying some statistical methods to frequent pattern set gener- 
ated by our method. In addition to web perfecting, link topology can be 
modified by examining frequent patterns. Reaching popular pages in fre- 
quent patterns can be made easier by changing link topology. Length of 
the most frequent navigation paths can be decreased by analyzing frequent 
patterns discovered by our method. By changing the link topology, web 
users' searches for target pages becomes easier. 



This paper is organized as follows. The next section is dedicated to ses- 
sion reconstruction operation. It first summarizes previously used reactive 
heuristics, and a recently proposed heuristics. After that, it introduces the 
agent simulator that was used to evaluate different session reconstruction 
heuristics, and finally it experimentally evaluates the accuracy of the first 
phase. Section 3 discusses pattern discovery from the reconstructed ses- 
sions, firstly by introducing a modified apriori technique used for pattern 
discovery, and then it analyzes the performance of pattern discovery phase. 
Finally, we give our conclusions. 



http://www.w3.org/Daemon/User/Config/Logging.html#common-logfile- 
format 
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2 Session Reconstruction 

2 . 1 Previous Heuris tics 

Previous Reactive session reconstruction heuristics [23] use page access 
timestamps and navigation information of the users. Time oriented heuris- 
tics [7,22] are based on time limitations on total session time or page-stay 
time. In the first type, total time of the session can not be greater than 
predefined threshold. In the second type, predefined threshold is used for 
checking page-stay time. Time oriented heuristics lack path information 
since they do not consider page connectivity. 

Navigation-oriented approach [7,8] takes web topology in graph format. It 
considers webpage connectivity, however, it is not necessary to have hyper- 
link between two consecutive pages. In case of any missing link, backward 
browser movements are inserted if one of the previously accessed pages 
refers to new page. In navigation-oriented heuristics artificially inserted 
links with backward browser movements is a major problem, since although 
the rest of the session always corresponds to forward movements in web 
topology graph. It is difficult to interpret these patterns. Sequential pages 
accessed from server side can not be extracted. In addition, extra backward 
movements makes sessions longer. Also there is no time limitation, for a 
client which has access set in very different time. The length of the session 
becomes very long. 



2.2 Smart-SRA 

Smart-SRA [5,4] is new method proposed by us for solving deficiencies 
of time and navigation oriented heuristics. Smart-SRA produces sessions 
containing sequential pages accessed from server-side satisfying following 
rules: 

Timestamp Ordering Rule: 

• V i : l<i<n, Timestamp^) < Timestamp(P^ + j) 

• V i : l<i<n, Timestamp(P^ + ^) - Timestamp(P^) < p (page stay time) 

• Timestamp(Pn) - Timestamp(Pi) < 5 (session duration time). 

Topology Rule: 

• V i : l<i<n, there is a hyperlink from P-^ to Pj+i 
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Fig. 1. An example web site topology graph 

Smart-SRA uses page-stay and session duration rules of time-oriented heuris- 
tics. It uses topology rule as in navigation-oriented heuristics. It can be 
accepted as improved version of combined time and navigation oriented 
heuristics since it performs path completion and separation more intelli- 
gently. Smart-SRA composed of two phases. In the first phase of Smart- 
SRA, time criteria (page-stay and session duration) are applied for gen- 
erating shorter sequences from raw input. In the second phase, maximal 
sub-sessions are generated from sequences generated in the first phase in 
a way that each consecutive page satisfies topology rule. Session duration 
time is also guaranteed by the first phase. However, page stay time should 
be controlled since consecutive web page pair generated in the first phase 
can be changed in second phase from the set of pages satisfying session 
duration time. 

In the first phase of Smart-SRA, time criteria (page-stay and session dura- 
tion) are applied for generating shorter sequences from raw input. In the 
second phase, maximal sub-sessions are generated from sequences gener- 
ated in the first phase in a way that each consecutive page satisfies topology 
rule. Session duration time is also guaranteed by the first phase. However, 
page stay time should be controlled since consecutive web page pair gener- 
ated in the first phase can be changed in second phase from the set of pages 
satisfying session duration time. 

The second phase adds referrer constraints of the topology rule by elimi- 
nating the need for inserting backward browser moves. This is achieved by 
repeating the following steps until all pages in a candidate session obtained 
after the first phase have been processed: 

(1) The web pages without any referrers are determined in the candidate 
session from the web topology. 

(2) These pages are removed from the candidate session. 

(3) They are appended to the previously constructed sessions, if there is a 
hyperlink from the last page of a session to new web pages. 



Considering the web topology given in Figure 1, for the candidate session 
[ p l' p 20' p 23' p 13' p 34] obtained after the first phase, Smart-SRA discovers 
the sessions [P l7 P 2 q, P23] and [P^, P 13 , P34]. 



2.3 Agent Simulator 

It is not possible to use web server supplied real user navigation data for 
evaluating and comparing different web user session reconstruction heuris- 
tics since all of the actual user requests cannot be captured by processing 
server side access logs. Especially the sessions containing access requests 
served from a client's and/or proxy's local cache cannot be known or pre- 
dicted by a web server. Therefore, we have developed an agent simulator 
that generates web agent requests by simulating an actual web user [5,4]. 

Our agent simulator first randomly generates a typical web site topology 
and then simulates a user agent that accesses this domain from its client 
site and navigates (randomly) in this domain like a real user. In this way, 
we will have full knowledge about the sessions beforehand, and later when 
we use a heuristic to process user access log data to discover the sessions, 
we can evaluate how successful that heuristic was in reconstructing the 
known sessions. While generating a session, our agent simulator eliminates 
web user navigations provided via a client's local cache. Since the simulator 
knows the full navigation history at the client side, it can determine naviga- 
tion requests that are served by the web server, and those are served from 
the client/proxy cache. Also, our agent simulator knows which page is the 
actual referrer (a page with a hyperlink to the accessed page, and this new 
page is accessed by following this link) of any page requested from server. 

Agent simulator produces an access log file at server side containing page 
requests whose pages are provided by the web server. The sessions discov- 
ered by a heuristics are compared with the original complete session file. For 
example, consider an agent with complete page sequences of [P^, P20, P23] 
and [Pi, P13, P34] generated by the agent simulator, which correspond to 
the real sessions. However, in the web server log, this sequence may appear 
as [Pi, P2Q, P23/ P13/ P34]/ since the browser of the client can provide the 
movement from P23 to P13 through Pj using its local cache, which means 
the second request for page ?i will not be sent to the web server. In this 
example, our agent simulator generates an agent acting as a web user who 
requests pages P^, P20, P2 3 consecutively, and then, returns backward to P^, 
and requests page P13. Therefore, our agent simulator knows that the actual 
referrer of P13 is Pj. Finally, user agent requests page P34 from page P13, 
and thus, the agent simulator generates a session [P1/P13/P34L Heuristics 
used to reconstruct user sessions are run on the server side log data, and 
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they construct candidate session sequences. These candidate sequences are 
compared to the real session sequences in order to determine the accuracy 
of the heuristics. 

An important feature of our agent simulator is its ability to model dynamic 
behaviors of a web agent. It simulates four basic behaviors of a web user. 
These behaviors can be used to construct more complex navigation behav- 
iors in a single session. These four basic behaviors constructing complex 
navigations are given below: 

(1) A Web user can start session with any one of the possible entry pages of 
a web site. This behavior includes new page which is not requested by 
any other previous page accessed from the same domain in near-time 

(2) A Web user can select the next page having a link from the most recently 
accessed page. 

(3) A Web user can press the back button one more time and thus selects 
as the next page a page having a link from any one of the previously 
browsed pages (i.e., pages accessed before the most recently accessed 
one). 

(4) A Web user can terminate his/her session. 

Agent simulator also uses time considerations while simulating the behav- 
iors described above. In the second and the third behaviors, the time dif- 
ference between two consecutive page requests is smaller than 10 minutes. 
Also, in these behaviors, time differences of access time of the next page and 
the current page will have a normal distribution. In addition, the median 
value for a page stay time is taken as 2.12 minutes (from [23]), and the stan- 
dard deviation is taken as 0.5 minutes. The generated time differences set 
for each type of these behaviors constitute a normal distribution. 

Four primitive basic behaviors given above are implemented in our agent 
simulator. Also, the following parameters are used for simulating navigation 
behavior of a web user. 

Session Termination Probability (STP): STP is increased as the length of 
a user session increases. The probability of terminating a session at the n th 
request is defined as (1 - (1 - STP) n ). 

Link from Previous pages Probability (LPP): LPP is the probability of 
referring next page from one of the previously accessed pages except the 
most recently accessed one. This parameter is used to allow the generation 
of backward movements from browser. 

New Initial page Probability (NIP): NIP represents the probability of se- 
lecting one of the starting pages of a web site during the navigation, thus 
starting a new session. 
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2.4 Performance Evaluation of Session Reconstruction Phase 

The most important performance criterion of the session reconstruction 
heuristics is the accuracy of the constructed sessions. Agent simulator can 
be used to measure the accuracy of the session reconstruction phase. We 
can simply define the accuracy of a heuristic is as the ratio of correctly 
reconstructed sessions over the number of real sessions generated by the 
agent simulator. 

A reconstructed session is correct if it captures a real session. We assume 
that a session H, reconstructed by a heuristic, captures a real session R, if R 
occurs as a subsequence of H. A session P with length n is a sub-session of 
a session S with length m (denoted as P C S) if there is an index k of S, such 
that, l=k=m and k+n-l=m, that satisfies the following: 

s k = P V s k+l = p 2' s k+2 = p 3 • • • s k+n-l = p n 

For example, if R = [P lr P 3 , P 5 ] and H = [P 9 , P lr P 3 , P 5 , Pg], then, R c H 
since P]_, P3 and P5 are elements of H and they are all in the same order. On 
the other hand, if H = [P]_, Pg, P3, P5, Pg], then, R £ H, because Pg interrupts 
R in H. Searching real sessions in candidate sessions produced by heuristics 
can be done by using a simple algorithm adopted from an ordinary string 
searching algorithm. 

Our agent simulator first generates a web domain, and then it produces sim- 
ulated sessions and a corresponding web log file containing client requests 
for web pages. Then, a reconstruction heuristic processes this log file and 
generates candidate sessions. After that, the accuracy of the heuristics can 
be determined by using the reconstructed sessions and original simulated 
sessions. 

As mentioned above, the accuracy of session reconstruction heuristics can 
be calculated with respect to 3 parameters, namely STP, LPP, and NIP. For 
evaluating the accuracy performance of different heuristics, random web 
sites and web agent navigations are generated by using the parameters 
given in Table 1. The number of web pages in a web site and the average 
number of out degrees of the pages (number of links from one page to 
other pages in the same site) are taken from 2 . Varying values of the three 
parameters defined in the previous section, namely STP, LPP, and NIP, are 
used for comparing the performances of the heuristics. 

In our experiments, we have fixed two of these parameters and obtained 

2 http://www.sims.berkeley.edu/research/projects/how-much- 
info/internet/rawdata.html 
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Table 1 



Agent Simulators parameters 



Parameter 


Range 


Average Number of web pages (nodes) in topology 


300 


Average number of outdegree 


15 


Average number of page stay time 


2.2min 


Deviation for page stay time 


0.5min 


Number of agents 


10000 


Session Termination Probability (STP) 


Fixed: 5%, Varying: [1%,20%] 


Link From Previous Page probability (LPP) 


Fixed: 30%, Varying: [0%,90%] 


New Initial Page probability (NIP) 


Fixed: 30%, Varying: [0%,90%] 



performance results for the third parameter. Therefore, three sets of exper- 
iments are performed. In the first experiment, LPP and NIP are fixed as 
30%, and STP is varying from 1% to 20%. In the second experiment, LPP is 
varying from 1% to 90% and STP is fixed as 5% and NIP is fixed as 30%. 
Similarly, in the third experiment, NIP is varying and STP and LPP are fixed 
as 5% and 30% respectively. 

In the first experiment, increase in STP leads to sessions with fewer pages. 
The accuracy is higher for shorter sessions. If the navigation is affected by 
LPP and NIP, then, the session becomes more complex. If there is no return 
back to an already visited page and there is no new initial page, then, the 
session becomes simple and it can easily be captured. So, increasing NIP and 
LPP decreases the accuracy performance in contrast to STP. The accuracies 
of 4 heuristics (limited total session time: TOl, limited page stay time: TO2, 
navigation oriented: NO and Smart-SRA: SSRA) for various parameters are 
given in Figures 2, 3 and 4. As it can be seen from these figures Smart-SRA 
outperforms other previous heuristics (see [5,4] for more details). 



3 Discovering Patterns 

3.1 Sequential Apriori Technique 

A modified version of the classical apriori [2] technique was used for dis- 
covering the frequent user access patterns from the reconstructed maximal 
sessions. Unlike the ordinary large itemset discovery problem, in the user 
web access pattern discovery problem, consecutive pages in the discovered 
pattern should also appear in consecutive positions in the reconstructed ses- 
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Accuracy vs STP 
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Fig. 2. Reconstructed session accuracy for varying STP 



Accuracy vs LPP 
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Fig. 3. Reconstructed session accuracy for varying LPP 
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sions supporting the pattern. Therefore, frequent web access patterns can 
be obtained from reconstructed sessions by using a more efficient and sim- 
plified version of apriori technique. A session S supports a pattern P if and 
only if P is a subsequence of S (P c S). We call all the sessions supporting a 
pattern as its support set. That is, a reconstructed session S e SupportSet(P) 
if P c S. 

Sequential AprioriAll Algorithm (Algorithm 3): In the beginning, each 
page with sufficient support forms a length-1 supported pattern. Then, in 
the main step, for each k value greater than 1 and up to the maximum 
reconstructed session length, supported patterns (patterns satisfying the 
support condition) with length k+1 are constructed by using the supported 
patterns with length k and length 1 as follows: 

• If the last page of the length-k pattern has a link to the page of the length- 
1 pattern, then by appending that page length-k+1 candidate pattern is 
generated. 

• If the support of the length-k+1 pattern is greater than the required sup- 
port, it becomes a supported pattern. In addition, the new length-k+1 
pattern becomes maximal, and the extended length-k pattern and the 
appended length-1 pattern become non-maximal. 

• If the length-k pattern obtained from the new length-(k+l) pattern by 
dropping its first element was marked as maximal in the previous itera- 
tion, it also becomes non-maximal. 

• At some k value, if no new supported pattern is constructed, the iteration 
halts. 

Notice that in the sequential apriori algorithm, the patterns with length-k 
are joined with the patterns with length-1 by considering the topology rule. 
This step significantly eliminates many unnecessary candidate patterns be- 
fore even calculating their supports, and thus increases the performance 
drastically. In addition, since the definition of the support automatically 
controls the timestamp ordering rule with the sub-session check, all discov- 
ered patterns will satisfy both the topology and the timestamp rules, which 
are very important in web usage mining. 

An auxiliary function Support (I:Pattern,S) determines whether a given pat- 
tern has sufficient support from the given set of reconstructed user sessions. 
Support of a pattern I is defined as a ratio between the numbers of recon- 
structed sessions supporting the pattern I, the number of all sessions. 



c x/t c\ \{Si\Vi and I is substring of Si}\ 

Support(I,S) = — (1) 
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Algorithm 1 Sequential Apriori 



1: input: Minimum support frequency: 5 

2: Reconstructed sessions: S 

3: Topology information as matrix: Link 

4: The Set of Web Pages: P 

5: output: Set of maximal frequent patterns: Max 

6: procedure sequentialApriori (6, S, Link, P) 

7: Li := {} {Set of frequent length-1 patterns} 

8: for i:=l to |P| do 

9: Lj := L 1 U {[Pi] | if Support^ ],S) > 5 } 

10: for k=l to N-l do 

11: if L k = then 

12: Halt 

13: else 

14: L k+1 := {} 

15: for each 1^ G L k 

16: for each P j G P 

17: if Link[LastPage(I i ), P j ] = true then 

18: T := Ii • Pj //Append Pj to Ii 

19: if Support(T,S) > then 

20: T.maximal := TRUE 

21: 1-^. maximal := FALSE //since extended 

22: V:=[T 2 , T 3 ,. . . , T| T |] {drop first element} 

23: if V G L k then 

24: V.maximal := FALSE 

25: L k+1 := L k+1 U {T} 

26: end if 

27: end if 

28: end if 

29: end for each 

30: end for each 

31: end if 

32: end for 

33: Max := {} 

34: for k :=1 to N-l do 

35: Max := Max U {S | S G L k and S.maximal = true} 

36: end for 

37: end procedure 
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Table 2 

Reconstructed Sessions Database 



Session Id 


Session 


-1 

1 


IPl,Pl3,P49,P23J 


2 


[Pl^l3^34,P 23 ] 


3 


[Pl,Pl3,P49] 


4 


[Pl/P20/P23] 


5 


[P13/P49] 



Table 3 

Patterns Generated at each Iteration 



Step 


Patterns 


Frequencies 




{[P1MP13LLP23], 


{0.80, 0.80, 0.60, 0.60} 


1 


[P49]} 


> 0.40 




{[P20UP34]} 


{0.20, 0.20} < 0.40 


2 


{[P1.P13LfP13.P49]} 


{0.60, 0.60} > 0.40 


{[P49,P 23 ]} 


{0.20} < 0.40 


3 


{[P 1 ,P 13 ,P 4 9]} 


{0.40} > 0.40 


{[Pl 3/ P 49 ,P23]} 


{0.20} < 0.40 



Let the list of sessions in Table 2 be generated by some session reconstruction 
heuristic from the server logs. Let 5 = 0.40 be taken as minimum support 
for the Sequential Apriori algorithm. Then, the execution of the sequential 
apriori technique will generate patterns with their frequencies in three it- 
erations as it is shown in Table 3. In this table, the patterns shown in gray 
areas are eliminated due to their insufficient support. Since at iteration 4, 
there are no remaining frequent patterns, the algorithm stops. The maximal 
frequent patterns are shown in bold in Table 3. The only maximal pattern is 
[Pi, P 13 , P49] with support 0.40. 

3.2 Performance of Sequential Apriori Technique 

In this subsection, we experimentally determine the accuracies of the max- 
imal frequent patterns generated by the sequential apriori technique using 
the sessions reconstructed by different session reconstruction heuristics. Af- 
ter the reconstruction of the sessions, they are processed by the sequential 
apriori algorithm in order to discover the frequent patterns in these ses- 
sions. Sequential apriori technique is also applied to the actual sequences 
generated by the agent simulator. Since, we know the frequent maximal 
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Table 4 

Parameters Used for Modeling Web Users 



Experiment No 


STP 


LPP 


NIP 


1 


0.10 


0.20 


0.20 


2 


0.10 


0.20 


0.40 


3 


0.10 


0.40 


0.20 


4 


0.10 


0.40 


0.40 


5 


0.20 


0.20 


0.20 


6 


0.20 


0.20 


0.40 


7 


0.20 


0.40 


0.20 


8 


0.20 


0.40 


0.40 



patterns of the sequences of the agent simulator (MPa), which correspond 
to the correct frequent patterns, we can determine the accuracies of different 
heuristics (Ah stands for the accuracy of a heuristic H) by using the maximal 
frequent patterns generated by these heuristics (MP H ) as follows: 



A 



ii 



\MP A n MP H \ 



(2) 



In our experiments, we have studied the accuracy by varying 4 different 
parameters, namely STP, LPP, NIP and the support. In each experiment we 
have fixed the three parameters (STP, LPP, and NIP) which are used to define 
the behavior of an agent, and obtained the accuracies for varying support 
values. For each one of these parameters, we have used two typical values, 
namely 0.10 and 0.20 for STP and 0.20 and 0.40 for LPP and NIP. Also, in each 
experiment the support is defined from 0.05% to 0.25%. Table 5 summarizes 
the parameters used in these experiments. Among these parameters the 
4th experiment gave the lowest and the 5th experiment gave the highest 
accuracy results for all heuristics and support values. 

The important result of these experiments is the large improvement of the 
accuracies of the frequent pattern sessions in the second phase. As it can be 
seen from Figures 5 and 6, corresponding to the 4 f?I and the 5 th experiments, 
respectively, the accuracy of the second phase is always much higher than 
the accuracy of the reconstructed sessions. It is also observed that the ac- 
curacy of discovered frequent maximal patterns is about 30% higher when 
Smart-SRA is used. Similar results were also obtained for other 6 experi- 
ments. 



This result is not surprising because in the first phase real sessions must 
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Support vs Accuracy 




Fig. 5. Support vs. Accuracy (Experiment no 4) 
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Fig. 6. Support vs. Accuracy (Experiment no 5) 
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appear as sub-sessions of reconstructed sessions in order to be considered 
as correct, but, on the other hand, in the second phase, only frequent real 
patterns must appear as sub-sessions of frequent reconstructed sessions in 
order to be considered as correct. Thus, it is more likely to obtain higher 
accuracy in the second phase. In the session reconstruction phase, for 100% 
accuracy of any pattern, it must appear in the reconstructed sessions for 
each of its occurrence in the actual sessions. On the other hand, for frequent 
pattern discovery phase, it is sufficient if the pattern appears as many times 
as it is required by the support value. Therefore, for frequent patterns we 
obtain much higher accuracies. 



4 Conclusion and Future Work 

In this paper we have introduced a new frequent web usage pattern dis- 
covery method. Frequent patterns are discovered among the reconstructed 
sessions. Sessions can be reconstructed by using various heuristics. In our 
experiments we have used a recently developed heuristic Smart-SRA and 
time and navigation oriented heuristics for this first phase. Then, we have 
used a newly proposed sequential apriori technique in order to discover 
frequent patterns in the set of reconstructed sessions. 

In the session reconstruction phase, for a reconstructed session to be as- 
sumed as accurate, it must include a session generated by the agent simula- 
tor. A frequent pattern is accepted as accurate if it appears only as a frequent 
pattern in the reconstructed sessions. Therefore, the accuracy increases after 
the pattern discovery phase, since complex navigational behaviors, which 
are hard to discover, but also infrequent are eliminated at this phase. 

The main purpose of WUM is to extract useful user navigation patterns. 
Therefore, it is not sufficient only to reconstruct user sessions from server 
logs. Capturing frequent user navigation patterns is more significant. This 
is achieved by employing frequent pattern discovery techniques after ses- 
sions are reconstructed. It is important to observe that the success of session 
reconstruction phase of WUM affects the success of the frequent pattern 
discovery phase. Moreover, our experiments also show that regardless of 
which heuristic is used for session reconstruction, in frequent pattern dis- 
covery phase the accuracy always increases. Also, by adjusting the support 
parameter of the apriori technique it is possible to control the frequency 
requirement of the common patterns searched in user navigation behaviors 
as well as the number of patterns discovered. 

As a future work, modifying SRA heuristic for proactive session reconstruc- 
tion can be considered by using other additional information sources. Also, 
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our agent simulator can be improved in order to represent user navigation 
behaviors more correctly by adding new features. 
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